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THE FULL NETWORK ARCHITECTURE OF THE WAYBACK MACHINE ASSIGNS SPECIFIC TASKS TO 
THE VARIOUS COMPUTERS WHICH SERVE THE DATA. IN FACT, EACH MACHINE WILL PERFORM 
ONE OF THREE DISTINCT TASKS. A COMPUTER WILL SERVE EITHER AS A CGI MACHINE, 
wb_server HOST, OR wb_tcp HOST. THESE THREE FUNCTIONS are all performed BY ONE 
MACHINE IF THE "simple" ARCHITECTURE IS USED (AS OPPOSED TO THE "network" 
ARCHITECTURE). HERE ARE A FEW POINTERS FOR CONFIGURING AND RUNNING THE WAYBACK 
MACHINE. 

I. CGI MACHINE 

1. LOCATION OF WAYBACK TREE: /al exa/apache/vhosts/archi ve 

2. LOCATION OF APACHE CONFIG FILE: /al exa/apache/conf/httpd . conf 

THIS FILE NEEDS TO BE EDITTED IF THE NAME OF THE CGI MACHINE CHANGES. 
THE FIELDS THAT MUST BE EDITTED ARE THE "ServerName" APACHE FIELD AND 
THE "Redi rectMatch" FIELDS IN THE VIRTUAL HOST DIRECTIVE. 

3. STARTING APACHE: /al exa/apache/bi n/apachectl start | stop | graceful | restart 

RUN AS ROOT. BECAUSE THE WAYBACK MACHINE RUNS UNDER MOD_PERL, APACHE 
SHOULD BE RESTARTED AFTER ANY CHANGES ARE MADE TO THE CODE. THIS ISN'T 
NECESSARY, BUT SINCE APACHE CACHES THE CGI SCRIPTS, CODE CHANGES WON’T 
BE SEEN UNTIL EACH APACHE PROCESS HAS DIED AND BEEN RESTARTED. 

4. apache LOGS: /al exa/apache/1 ogs -> /export/logs 

BOTH ERROR AND ACCESS LOGS WILL BE LOCATED HERE. 

5. WAYBACK CACHE LOCATION: /al exa/apache/vhosts/archi ve/1 i ve_di r 

/al exa/apache/vhosts/archi ve/db_di r 

THE CACHE OF DOCUMENTS RETRIEVED FROM THE LIVE WEB AND ARCHIVE WILL 
BE STORED IN THESE TWO DIRECTORIES. THE IN-PROGRESS ARC FILES WILL 
RESIDE IN 1 i ve_di r , AND THE LOOK-UP TABLES (DBM FILES) IN db_dir. 

THESE DIRECTORIES MUST BE OWNED AND WRITEABLE BY THE APACHE USER. 

6. configuration FILE: /al exa/apache/vhosts/archi ve/cgi -bi n/wayback . cgi 

THIS IS THE FILE WHICH CONFIGURES THE WAYBACK MACHINE AND EXECUTES 
EACH INCOMING REQUEST. TURN THE FOLLOWING OPTIONS ON BY ASSIGNING 
’1’ AND OFF BY ASSIGNING / undef’ . WHERE THESE OPTIONS DON’T MAKE 
SENSE, USE ’undef' FOR THE DEFAULT BEHAVIOR (EXAMPLE: $MAX_N_RECORDS) . 


A DESCRIPTION SHOULD ACCOMPANY EACH 

OF THESE VARIABLES EARLIER IN 

THE FILE. 






my 

$DEBUG 


0; 

# 

'1' 

FOR DEBUG MESSAGES, 'O' ELSE 

my 

$TIME 

— 

0; 

# 

’1’ 

FOR TIMING MESSAGES, 'O’ ELSE 

my 

$DEFAULT_COLLECTION 

= 

undef; 

# 

COLLECTION IF NONE PROVIDED IN URL 

my 

$DISABLE_EXCLUDE_FILE 

= 

i; 

# 

'1' 

TO DISABLE EXCLUDE FILES 

my 

$DISABLE_LIVE_WEB 

= 

1; 

# 

'1' 

TO DISABLE LIVE WEB RETRIEVALS 

my 

$DISABLE_REDIRECTS 

— 

undef; 

# 

'1' 

TO DISABLE REDIRECTS 

my 

$DISABLE_ROBOTS 

= 

undef; 
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# 

' 1 1 

TO DISABLE ROBOTS 



RULES 


my $DO_ROBOTS_l P_L OOKU PS = undef; 

my $ENFORCE_USER_AGREEMENT = undef; 
my $MAX_ N_R ED I R E CT S = undef- 

my Smax_n_records = undef : 

my $max_robots_duration = 84600; 

my Smode = "network 

my $ N ETWOR K_R EQU E ST_MOD E = undef* 

my $USE_WEB_FOR_ROBOTS = undef: 

my $USE_BANNER = undef* 

my $URL_PURIFY_DIRECT 0 RY = undef*’ 


# '1' TO FIND ROBOTS IP ADDRESSES 
f ’1' TO ENFORCE A USER AGREEMENT 

# NUMBER OF REDIRECTS TO FOLLOW 

# NUMBER OF QUERY RECORDS TO RETURN 

# NUMBER OF SECONDS TO USE ROBOTS 

; ^.." net M 0rk " F0R FULL ARCHITECTURE 

# pops FOR pops "network" 

# 1' TO GETS ROBOTS FROM WEB 

# '1' TO USE A BANNER 

# DIRECTORY WITH URL PURIFIY FILES 


BECAUSE THE WAYBACK MACHINE RUNS UNDER MOD_PERL 
RESTARTED AFTER ANY CHANGES TO THE CODE. 


APACHE SHOULD BE 


7. NETWORK wb_server CONFIG FILE: 


/al exa/apache/vhosts/archi ve/cgi -bi n/wb__network_al pha . conf 


"™ E ^GI MACHINE READS THIS FILE TO DETERMING WHICH wb server TO OUFRY 
tp R Jw E £ EQUESTED data - if the name of the wb_server _ TOST £h!nges or 
™eSE CHANGls Ver H ° STS ARE ADDED ’ ™ IS FILE MUST BE EDITTED TO REFLECT 


8. PERL LIBRARIES: 


a. BUILD PERL WITH' LARGE-FILE SUPPORT 

b. INSTALL THE FOLLOWING MODULES (OR MORE RECENT VERSIONS OF THEM): 

1. Time-HiRes-01.20 

2. MlME_Base64-2 . 12 

3. URl-1.18 

4. HTML-Tagset-3.03 

5. HTML-Parser-3 .25 

6. Compress-zl ib-1.16 

7. libnet-1.10 

8. Di gest-MD5-2 . 16 

9. 1 i bwww-perl -5 . 64 

10. Net-DNS-0 . 14 

11. DB_Fi 1 e-1.803 

12. mod_perl-1.26 

II. wb_server host 


1. wb_s erver tree: /al exa/wb_server 

2. config file: /al exa/wb_server/wb_server . conf 

3. STARTING THE SERVER: 

RUN e AS^ROOT erVer ^ Wb “ SerVer ~ daem ° n ( ‘ Start I st0 P 1 restart) 

4. PATH FILE NOTE: 


^ F ™ E NAME 0F THE wb_tcp HOST CHANGES, OR IF ANY ARC FILES 0 

t2c T mc ARE M0VED ’ THE p ATH FILE MUST BE MODIFIED/REGENERATED T 
THE NEW NETWORK PATH TO THE ARC FILES. 


wb_tcp 

REFLECT 


III. wb_tcp HOST 


1. wb_tcp TREE: /alexa/wb_tcp 
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2. STARTING THE SERVER: 


wb_use 


/al exa/wb_tcp/wb_tcp_daemon (start | stop | restart) 

RUN AS ROOT. THIS SERVER SHOULD START AT BOOT TIME. 


3. wb_tcp NOTE: 


MACHINE. 


THE Wb_tcp SERVER WILL START AT BOOT-TIME. IT ANSWERS REQUESTS FROM THE 

TUPM e £X e £uA? R -rbr CA t; D0CUMENTS ’ retrieves them from arc files, and returns 

THEM SO THAT THE wb_server CAN RETURN ARCHIVED DOCUMENTS TO THE CGI 
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DocumentRoot 


cgi-bin collections db^dir 


wayback . cgi fetch_l i st_ng . cgi 


images live_dir styles 


arc_in_progress.txt 

ind__in_progress.txt 

rob_in_progress. txt 


— lock.txt 

— recent_url s .db 
recent_robots.db 


Collectionl.html Collectionl_di r Collection2.html Col 1 ection2_di r 


images styles 


search_results. html 
no_matches.htmT 
error_page.html 
searching_notice.html 


The simplest way to configure the web server (apache) is to use a 
vi rtual host di recti ve . 


<Vi rtualHost * >_ 

serverName archive.alexa.com'; rurr&Al U, wwf , ’ ip> 
Serve rAdmi nTfWPiiaTScf^l ^ Jr 

DocumentRoot /al exa/apache/vhosts/archi ve 
customLog /export/1 ogs/archi ve-access_log alexa 

# WAYBACK MACHINE 

# THE ORDER OF THE FOLLOWING ALIASES IS IMPORTANT. 


AliasMatch a$ /al exa/apache/vhosts/archi ve/index. html 
AliasMatch a/$ /alexa/apache/vhosts/archive/index.html 


Redi rectMatch A/e2k/*$ 
Redi rectMatch A/web/*$ 


http ://archi ve . al exa . com/col 1 ecti ons/e2k . html 
http : / /archi ve . al exa . com/col 1 ecti ons/web . html 


Al ias /wayback /al exa/apache/vhosts/archi ve 

Al1 as /collections /al exa/apache/vhosts/archi ve/col lections 

Alias /images /al exa/apache/vhosts/archive/images 

Alias / robots . txt /al exa/apache/vhosts/archi ve/a r chi ve_noti ces/robots . txt 


ScriptAlias /archive_request_ng 
/ al exa/apache/vhosts/archi ve/cgi -bi n/f etch_l i st^ng . cgi 


# SEND URLS TO THE WAYBACK MACHINE SCRIPT. 


Seri ptAl i asMatch /(web\/ . +) /al exa/ apache/vhosts/archi ve/cgi -bi n/wayback . cgi /$1 
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Sen ptAl 1 asMatch /Ce2k\/ . +) /al exa/apache/vhosts/archi ve/cgi -bi n/wayback. cqi /$i 
Sen ptAl i asMatch /( . +) /al exa/apache/vhosts/archi ve/cgi -bi n/wayback. cgi /$1 

</Vi rtualHost> 

We strongly encourage the use of mod_perl when running the Wayback Machine 
as wayBack.pm is a large and complex perl module, and the overhead of the ’ 
perl interpreter is significant when launched on each request. If you have 
mod_|Derl installed, include the following directive in your virtual host 


<lfModule mod_perl.c> 


cLocation "/archive_request_ng"> 
SetHandler perl -script 
PerlHandler Apache: : Registry 
Options ExecCGi 
allow from all 
Perl sendHeader on 

Perl set Env CGI_NAME "fetch_l ist_ng 
</Location> 


<Location ,, /$ARC_URL_PREFIX ,, > 

SetHandler perl-script 
PerlHandler Apache :: Regi stry 
options ExecCGi 
allow from all 
Perl SendHeader On 

Perl SetEnv CGI_NAME "WAYBACK MACHINE" 
</Location> 


and/or 


<Location "/col 1 ectionl"> 

SetHandler perl-script 
PerlHandler Apache: : Registry 
Options ExecCGi 
allow from all 
Perl sendHeader On 

Perl SetEnv CGI_NAME "WAYBACK MACHINE" 
</Location> 


<Location "/col 1 ectionN"> 

SetHandler perl-script 
PerlHandler Apache :: Registry 
options ExecCGi 
allow from all 
PerlSendHeader On 

Perl SetEnv cgi_name "wayback machine" 

</Location> 

Perlwarn On 
PerlTaintcheck on 

</ifModule> 

############################################################################# 
Z WAYBACK MACHINE CONFIGURATION it 
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T ?k W 3 y aC rr Mach l ne P er1 modu , le ’ WayBack. pm exports a number of configuration 
methods, if used, they must be employed Before handling any requests. Onlv 
the configure and bui ld_cdx„hash methods **MUST** be called, and the y 

build cdx_hash need not be called if using the "network" architecture. Here 
is a list of the configuration methods exported by WayBack. pm: 


disable_live_web (); # Do not retrieve documents from the live web. 

# The default behavior is to retrieve requested 

# documents from the live web if they do not 

# exist in the collection. 


di sable_redi rects () ; # Do not return redirects with alternate dates. 

# The default behavior is to return a redirect'if 

# the date of the document to be returned does not 

# exactly match the request date. This way, the 

# user knows the date of the document returned. 

di sable_excl ude_fi 1 e (); # do not filter domains with an exclude file. 

# The exclude file is used to filter out URL's 

# for domains which have requested not to be 

# archived. The default behavior is to use this 

# file to exclude url's. 

disable_robots (); # Do not filter documents with robots.txt rules. 

# The default behavior is to filter URL's based’ 

# on robots.txt rules. 


enabl e_debug_msgs O; # Turn on debugging. This will result in some 

# messages being written to stderr. 

enabl e_ti mi ng_msgs () ; # Turn on profiling. This aids in determining 

# where most of the time is spent during a 

# request. 


use__web_for_robots () ; # Retrieve robots.txt dgcuments from the web. 

# The default behavior is to retrieve robots.txt 

# documents from the archive. 


do_robots_ip_lookups () ; # Look up ip addresses for robots.txt documents. 

# If robots.txt documents are retrieved from the 

# live web, they are archived for addition to the 

# collection. iP-addresses are part of the 

# arc-file meta-data. IP lookups effect 

# performance, so they are not done by default. 

set_defaul t_col 1 ecti on : 


The name of the default data collection is "web". To chanqe this name 
pass the new default collection to this method 


set_defaul t_col 1 ecti on ($default_collection) ; 
set„max_robots_duration: 


The robots.txt documents are cached in a dbm file when they are used. This 
allows for speedier retrieval on the next request for this document. The 
default time period to keep the robots.txt document cached is one week. 
After this time period, a request for this document will requi re a new 
retrieval from either the archive or the live web, depending on the robots 
configuration . To change the default robots.txt caching duration, pass a 
value to this method, the time period specified in units of seconds. 
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set_max_robots_duration ($max_robots_duration) ; 
set_max_n_redi rects : 

Many archived documents contain redirects to other urls, either because of 
Location: url HTTP headers content refreshed, or some other mechanism. 

The default behavior is to follow 3 redirects and then fail. To chanqe the 
number of redirects to follow, pass the value to this method. 9 

set_max_n_redi rects ($max_N_redirects) ; 

set_max_n_records : 

There are many copies of some documents. We don't want to return 10 000 
records to the user because it would burden their browser to have to’load and 
render such a page. The default number of records to return is 1000 To 
change the maximum number of records to return, pass the value to this 
method . 


set_max_n_records ($MAX_N_records) ; 
set_mode : 


The WaybackMachine supports two architectures: "simple" and "network", in 
the simple architecture the wayBack.pm module (along with supporting perl 
modules) constitutes the whole of the wayback Machine. The "network" 
architecture is employed by Alexa and should only be utilized if: 

1. The archive is very large, utilizing a distributed cdx file system 

2. Many different collections are supported 

3. one has significant systems administration/operations resources 

The "network" architecture is detailed in a technical specification on the 
tu? 5 lt: ? and ?? employed as a 4-tier server, utilizing many machines. 

The default archi tecture , or mode, is "simple". To enable the "network" 
architecture, pass that string to this method. 


set_mode ($MODE) ; 
configure: 


This is the only configuration method which **MUST** be called in the perl 
cgi which imports wayBack.pm. This method must be passed all of the files 
which will be used by the wayback Machine as well as some information about 
the archival URL it should expect to parse. 


/6COLLECTION_hash is a hash of hashes. The hash keys are the possible 
collections which can be specified in the archival URL, and the value of 
each possible collection must be an array which specifies the files to use 


$NO_RESULT_FILE 

$RESULT_FILE 

$ERROR_FILE 

$SEARCHING_FILE 


# html template for unsuccessful query results 

# html template for successful query results 

# html template for error message 

# html for "Searching the Archive" notice 


Example: 

declare: 


if we support a "web" collection and a "e2k" collection, we might 


my %COLLECTION_HASH = ( 

"web" => [ 

" ../coll ecti ons/web/no_matches . html " , 
"../coll ecti ons/web/search_resul ts . html " , 
" . . /col 1 ecti ons/web/er ror_page . html " , 
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"e2k" 
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../coll ecti ons/web/searchi ng_noti ce . html " , 


M . . /col 1 ecti ons/e2k/no_matches . html " , 

( " . . /col 1 ecti ons/e2k/search_resul ts . html " , 

” . . /col 1 ecti ons/e2k/er ror_page . html " , 

" . . /col 1 ecti ons/web/searchi ng_noti ce . html " , 


@cache^files is a list of files used in the caching/archi ving of data. 
This list must include (in order) the following files: 


$ARC_IN_PROGRESS_FILE 

$ROB_IN_PROGRESS_FILE 

$DB_LOCK_FILE 

$DBM_RECENT_URLS_FILE 


# arc-file to build from 

# live-web-retrieved documents 

# arc-file to build from live-web- retrieved 

# robots.txt documents 

# dummy file for flock file locking 

# berkeley db in which are stored 

# live-web- retrieved documents 

$DBM_RECENT_R0B0TS_FILE # berkeley db in which are stored recently 

# retrieved robots.txt documents 

An "undef" file entry disables this feature. An example decalaration 
might be: 


my @CACHE_FILES = ( 

./live_di r/a rc_in_progress.txt" f 
" • ./I i ve_di r/ rob_in_progress . txt" , 
" . ./db_dir/lock.txt", 

" . . /db_di r/recent_robots . db" , 

" . . /db_di r/recent_url s . db" , 


@EXCLUDE_FILES is a list of files used for doing additional data 
filtering. The necessary elements are: 


$EXCI_UDE_filf # sorted list of domains to exclude 
$C_EXCLUDE_FILE # sorted list of canonized domains to exclude 

AN ^ "undef" FILE ENTRY disables THIS feature. An example declaration might 


my @exclude_files = ( 

"/net/arc42/0/CRAWL/name_excl udes" , 

"/ net/ar c42/0/ CRAWL/ c_jiame_excl udes" , 


$arc_url_PREFIX is the URL path which signifies that the url is an 
archival URL. The archival URL is detailed in a technical specification 
on the Alexa site. Basically, if you configure your Apache web server to 
recognize that urls such as http://yourdomain.com/wayback/... signifies an 
archival url, then you would want to set: 


my $ARC_URL_PREFIX = "wayback/" ; 


if all urls on that domain (maybe you're using 
archival urls, set $ARC_URL_PREFIX = "". 


virtual hosts) are to be 
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/blMAGE_HA5H is a hash which specifies the images to use to denote paqe 
retrieval speeds. The hash values should be static URL paths. y 


my %IMAGE_HASH = ( 

''^st;; => "/wayback/i mages/fast. gif", 
^okay => '/wayback/i mages/okay. gi f" , 
slow => "/wayback/i mages/slow. gif" , 


^ho T n^h S f-T 3 ° f -, arrays ' “ lists all of the collections served and 

the path files which belong to each collection. This is used only with the 
simple" architecture. Example: y 

my %PATH_HASH = ( 

"web" => [ 

"/al exa/wayback/path/path . txt" , 

"/al exa/wayback/path/path .txt" , 

"e2k" => [’ 

"/al exa/wayback/path/e2 k_pathl.txt" , 

/al exa/wayback/ path/e2 k_path2 . txt" , 

); ] ’ 


configure (\%collection_hash, 

\@CACHE_FILES, 

\@EXCLUDE_FILES, 

Sarc_url_prefix, 

\%IMAGE_HASH, 
\%PATH_HASH) ; 

build_cdx_hash : 


This method constructs the hash of cdx header information for all 
cdx files used, it only applies to the "simple" archi tecture. 


of the 


%CDX_HASH is a hash of arrays, it lists all of the collections 
the cdx files which belong to each collection. 

"simple" archi tecture. 


— • — ■ ■ — — » i v cm anu 

This is used only with the 


my %cdx_hash = ( 
"web" => [ 


VI 


e2k" 


=> 


], 

[ 


/al exa/wayback/cdx/web_cdxl . cdx" 
/al exa/wayback/cdx/web_cdx2 . cdx" 


); 


1 1 
IV 




/ al exa/wayback/cdx/e2 k_cdxl . cdx" 
/al exa/wayback/cdx/e2k_cdx2 . cdx" 


j 

» 


Pass this hash to the bui ld_cdx_hash method. 
build_cdx_hash (\%cdx_hash) ; 

# INSTALLATION u 

############################################################################# 
To install WayBack.pm, you should be able to issue the following commands: 

perl Makefile.pl 
make 
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make test 
make install 


The Wayback Machine requires the following packages, available on cpan: 


DB_File; 

HTTP: : Request ; 
LWP: .‘UserAgent; 
Net: : DNS ; 

Time : : Local ; 
Time : :Hi Res ; 
Fcntl ":flock"; 
Compress: :zlib; 


# database management module 

# HTTP request utilities 

# LWP wrapper for HTTP request 

# DNS utilities 

# date manipulation routines 

# benchmarking module 

# allow for file locking 

# compression utilities 


The Wayback Machine also requires the following Alexa internet modules: 


Binsearch; # binary search utilities 

url Purify; # URL purification utilities 


Note that url Purify requires three text files, name_canon.txt, url^clean txt 
and commerse servers . txt to be located in the directory /alexa/url. This 
can be reconfigured in the urlPurify module with the set_canon_di rectory O 
method, but this has not been built into WayBack.pm. 

######################################## ##################################### 
# END u 

############################################### ############################## 
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######################################################################## ##### 

# README FOR ALEXA INTERNET WAYBACK MACHINE # 

############################################################################# 

This document describes the functionality, configuration, and installation 
of the wayback Machine, built by Alexa internet. The wayback Machine is an 
interface to an archive of internet documents, it allows users to query the 
archive for documents matching a URL and other optional criteria, retrieves 
documents from the archive, and attempts to reconstruct and render the 
document as it would have appeared when it was originally archived. 

Documents can be filtered in several ways, including robots.txt exclusions, 
and most of the behavior can be configured for various purposes. 

Engineer and Author: Dave sherfesee 

Alexa Internet 

############ ################################################################# 

# ARCHIVE # 

####################################### # ############### # ########### # ######### 

The Wayback Machine is an interface to a collection of internet documents 
(web pages, images, javascript files, etc.). These documents must be 
stored in arc-file format, indexed in cdx-file format, and referenced in a 
path file. Technical specfi cations for arc files and cdx files reside on the 
Alexa Internet web site (http://www.alexa.com), but here is a brief 
description . 

Arc File: 

An arc file (archive file) is an ascii text file -- typically 100 Mbytes in 
size -- which contains the archived documents. An arc file is formatted as 
follows: 

meta-data line 
archived document 

meta-data line 
archived document 

meta-data line 
archived document 

where the "meta-data line" is a " " delimited line containing five strings: 

URL ip-address Archive-date Content-type Archive-length 

URL is the URL of the archived document; iP-address is the IP address of the 
archived document; Archive-date is the 14-digit timestamp describing the 
time at which the document was archived (yyyymmddhhmmss) ; Content-type is 
the MIME type of the document; and Archive-length is the length (in bytes) 
of the archived document. All documents are separated from the next meta-data 
line by a newline ' \n ' character. Each arc file begins with a header 
describing the arc-file version and format, and I encourage you to read the 
Arc File Technical Specification on the Alexa Internet site. 

Alexa employs a proprietary compression scheme on the arc files, allowing 
for random access into the compressed file. This compression scheme must be 
used on the arc files for the ’’simple" Wayback architecture. The "network" 
architecture also allows for uncompressed documents. Straight gzip is no 
longer supported due to performance costs. 

cdx File: 
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A cdx file is an index file describing the content of the archive. The 
Wayback Machine uses the cdx files to determine what documents exist in 
the archive as well as their location in the archive. A cdx file is a sorted 
file, each line containing a " " delimited record of an archived document. 

The first line of a cdx file must be a cdx header, describing the format of 
the cdx file. There are many required fields in the cdx file, but they can 
be in any order, with the exception of the first field, which must be the 
canonized URL. Here is an example cdx header and record: 

cdx AbeamsckrVvDdgMn 
somedomai n . com/i mages/i magel . j pg 20000305225258 000 . 000 . 000 . 000 
www. somedomain. com : 80/i mages/image. jpg image/jpeg 200 

16d76ab0e3e2d38elc4b8a5504339fe7 16d76ab0e3e2d38elc4b8a5504339fe7 - 23163754 
62630679 2687031 9530122 arc_fi 1 e_path - 33895 

The cdx file header has the format ” CDX ????..." where each "?" represents 
a character. The character describes the data in that corresponding column. 

For instance, the cdx header above begins with the fields "A b e a m" which 
represent: 

A: canonized URL 
b: Archive-date 
e: IP-address 
a: original URL 
m: Content-type 

These fields, along with: 

s: HTTP Response Code 
c or k: checksum 
r: Redirect URL 

V or v: Compressed or uncompressed offset of document in arc file 
g: path to arc file 
n: content-length 

are requi red by the Wayback Machine. Once the applicable records for a URL 
are located in a cdx file, documents may be retrieved from the arc files by 
searching for the "path to arc file (g)" in a path file. 

Cdx files must be uncompressed in the "simple" Wayback archi tecture . The 
"network" architecture supports Alexa compression, but this is not recommended 
if performance is an issue. Straight gzip is not supported. 

Path File: 

A path file is simply a sorted ascii text file, each line being a " " delimited 
record of the arc files in the collection. Each line has the format: 

path-to-arc-file network-path /dev/null 

path-to-arc-file is the string found in the cdx file, network-path is the full 
path to the arc file on the network, and /dev/null is the path to another file 
(used internally by Alexa, but not necessary here). Path files are used so 
that arc files can be relocated within the network without having to recreate 
the cdx file. Path file recreation is simple and fast. 

Path files must be uncompressed. 

in general then, a typical document retrieval goes as follows: 

1. Search the cdx file for the record that best matches the search criteria 

2. use the arc file path in the cdx record to search the path file. 

3. Locate the arc file with the full network path found in the path file. 
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4. Retrieve the archived document using the offset and content-! enqth 
data from the cdx record. 


############### ############################################################## 
# FUNCTIONALITY # 

################################################# ############################ 
Here are the methods WayBack.pm exports for request handling and profiling. 
Request Handling: 


parse_archival_url () ; 
purify_url () ; 

handle_request (\Soutput) ; 

print_wayback_values () ; 
handle (method_call) ; 
Profiling: 


# Parse the archival URL. This method 

# retrieves data from the archival URL 

# and assigns it to the wayback object. 

# This routine purifies and canonizes 

# the url contained in $self->{URL} and 

# assigns the results to 

# $self->{PURE_URL} and 

# $sel f->{CANON_URL} . 

# This routine handles the request, 

# filling $output with the response 

# which is to be returned to the client. 

# This routine writes the values stored 

# in the wayback object. 

# This routine is used to wrap method 

# calls, trap errors, and detect output. 


note_method_start ("method.name") ; # Note start of method "method_name" 

note_method_end ("method.name") ; # Note end of method "method.name" . 

wri te_ti mi ng_msgs (); # Print the profiling data for the 

# request. 

############################################ ################################# 
# INTERFACE # 

#################################################^ ########################### 
Archival URL: 


The Wayback Machine is accessed via the archival URL. The archival URL is 
described in detail on the Alexa web site, but here is a brief description: 

http : //yourdomai n . com/${ARC_URL_PREFlx}datespec(request_fl ags) /url 

AS described in the WAYBACK MACHINE CONFIGURATION section, $ARC_URL_PREFIX 
1S 2- stnn 9 wh ‘ i< r h can be used in y°ur Apache (or other) web server 
cgnfiguration file in order to recognize archival URLs. For instance, you 
might designate all urls whose paths begin with wayback/ to be archival URLs 
In this case, your archival URL would look like: 

http : //you rdomai n . com/wayback/datespec(request_flags)/url 

if all yourdomain.com URLs are to be archival URLs, set $ARC_URL_prefix = : 

http: //you rdomai n. com/datespec(request_flags)/url 

The datespec helps to describe the type of request being made. Briefly: 
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2000 Most recent document from the year 2000 

20000101125555 Document closest to Jan 01, 2000 at 12:55:55 
1999* All documents from 1999 

1999-2001 Document between 1999 and 2001 

The request flags supported are built into the UI. some of them are: 

h_ Filter results based on exact hostname. 

ta - Place all results in a table, not just exact matches. 

sa - show all results. Don't filter out duplicates. 

Others will be implemented as the UI evolves. 

The url at the end of the archival URL is taken to be the request URL. If 
there was no '*' in the datespec, this URL is retrieved from the archive, 
if there was a in the datespec, the URL can end in in which case, 

the result will be a list of all URLs matching the datespec and the request 
URL up to the 1 For instance, an archival URL of 

http : //you rdomai n . com/1999 */somedomai n . com/* 

might return the following URLs: 

somedomai n . com/ 
somedomai n . com/i mages/ 
somedomai n . com/i mages/i magel . j pg 

since all of these URLs begin with "somedomai n . com/” . 

Fi 1 es : 


There are 4 files required by the wayback Machine for the UI, and they are 
described in the WAYBACK machine configuration section. They are: 


$NO_RESULT_FILE 

$RESULT_FILE 

$ERROR_FILE 

$SEARCHING_FILE 


# html template for unsuccessful query results 

# html template for successful query results 

# html template for error message 

# html for "Searching the Archive" notice 


The wayback Machine inserts the request output into these templates and 
returns them to the client. These templates must include the appropriate 
strings for data substitution. You should investigate the example UI files 
provided and look for strings like URL_to_substitute, number_to_substitute, 
and LIST_TO_SUBSTITUTE. 


############## ############################################################### 
# APACHE CONFIGURATION # 

################################################### ########################## 

Some web server configuration is required for the Wayback Machine. The 
default UI is set up as follows: 

Collection Front Pages: 

http : //you rdomai n . com/col 1 ecti ons/somecol 1 ecti on . html 
Exampl e: 

http : //yourdomai n . com/col 1 ecti ons/web . html 
http : //you rdomai n . com/col 1 ecti ons/e2k . html 

Directory Structure: 
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